Skip to content

feat(cluster): authenticate replica peers with PSK + BLAKE3 handshake#3425

Merged
hubcio merged 2 commits into
masterfrom
auth-in-cluster
Jun 10, 2026
Merged

feat(cluster): authenticate replica peers with PSK + BLAKE3 handshake#3425
hubcio merged 2 commits into
masterfrom
auth-in-cluster

Conversation

@hubcio

@hubcio hubcio commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

The server-ng replica port (tcp_replica) was plaintext: any TCP peer
that learned the cluster id could inject VSR frames or register as an
arbitrary replica. mTLS does not fit - the replica conn is a dup'd
plaintext fd round-robined across shards, state rustls cannot carry.

Authenticate with a pre-shared cluster key and a 3-message mutual
BLAKE3 keyed-MAC handshake (ReplicaHello / ReplicaChallenge /
ReplicaFinish) over the reserved GenericHeader bytes: no new typed
header, the stream stays a dupable plaintext fd. These are dedicated
Command2 variants, not a Ping/Pong squat (left free for a future VSR
liveness ping). ReplicaChallenge carries a status field, so a reject is
the same frame with a nonzero status and the finish is identified by
command, not position. The MAC proves PSK possession (cluster
membership, not per-replica identity - the registry still trusts the
announced id, so keep the port on a trusted boundary). The PSK is never
serialized to disk.

Configured under [cluster.auth] (enabled, shared_secret). Enabling it,
and adding the consensus-plane commands, is a coordinated-restart change
(cluster_id is derived from the cluster name).

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 5, 2026
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.12830% with 96 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.31%. Comparing base (be0c997) to head (4342353).

Files with missing lines Patch % Lines
core/message_bus/src/replica/listener.rs 75.69% 31 Missing and 4 partials ⚠️
core/server-ng/src/bootstrap.rs 0.00% 28 Missing ⚠️
core/message_bus/src/connector.rs 85.43% 20 Missing and 2 partials ⚠️
core/message_bus/src/replica/auth.rs 93.19% 9 Missing and 1 partial ⚠️
core/message_bus/src/replica/io.rs 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3425      +/-   ##
============================================
+ Coverage     74.29%   74.31%   +0.02%     
  Complexity      943      943              
============================================
  Files          1245     1247       +2     
  Lines        121687   122232     +545     
  Branches      97959    98530     +571     
============================================
+ Hits          90403    90841     +438     
- Misses        28328    28389      +61     
- Partials       2956     3002      +46     
Components Coverage Δ
Rust Core 75.45% <83.12%> (+0.06%) ⬆️
Java SDK 58.44% <ø> (ø)
C# SDK 69.41% <ø> (-0.52%) ⬇️
Python SDK 81.06% <ø> (ø)
PHP SDK 83.57% <ø> (ø)
Node SDK 91.26% <ø> (-0.10%) ⬇️
Go SDK 40.25% <ø> (ø)
Files with missing lines Coverage Δ
core/binary_protocol/src/consensus/command.rs 100.00% <100.00%> (ø)
core/binary_protocol/src/consensus/header.rs 79.68% <ø> (ø)
core/configs/src/server_config/cluster.rs 100.00% <100.00%> (ø)
core/configs/src/server_config/defaults.rs 77.65% <100.00%> (+0.05%) ⬆️
core/configs/src/server_config/http.rs 77.58% <100.00%> (+10.91%) ⬆️
core/configs/src/server_config/system.rs 95.17% <100.00%> (+0.31%) ⬆️
core/configs/src/server_config/validators.rs 75.89% <100.00%> (+1.75%) ⬆️
core/message_bus/src/replica/io.rs 82.98% <83.33%> (+<0.01%) ⬆️
core/message_bus/src/replica/auth.rs 93.19% <93.19%> (ø)
core/message_bus/src/connector.rs 89.33% <85.43%> (-4.99%) ⬇️
... and 2 more

... and 30 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hubcio hubcio force-pushed the auth-in-cluster branch 2 times, most recently from 97f4d4d to 67dc98d Compare June 5, 2026 11:31
The server-ng replica port (tcp_replica) was plaintext: any TCP peer
that learned the cluster id could inject VSR frames or register as an
arbitrary replica. mTLS does not fit - the replica conn is a dup'd
plaintext fd round-robined across shards, state rustls cannot carry.

Authenticate with a pre-shared cluster key and a 3-message mutual
BLAKE3 keyed-MAC handshake (ReplicaHello / ReplicaChallenge /
ReplicaFinish) over the reserved GenericHeader bytes: no new typed
header, the stream stays a dupable plaintext fd. These are dedicated
Command2 variants, not a Ping/Pong squat (left free for a future VSR
liveness ping). ReplicaChallenge carries a status field, so a reject is
the same frame with a nonzero status and the finish is identified by
command, not position. The MAC proves PSK possession (cluster
membership, not per-replica identity - the registry still trusts the
announced id, so keep the port on a trusted boundary). No secret - PSK,
encryption key, or JWT keys - serializes to the on-disk config snapshot.

Configured under [cluster.auth] (enabled, shared_secret). Enabling it,
and adding the consensus-plane commands, is a coordinated-restart change
(cluster_id is derived from the cluster name, single-node included).
Comment thread core/message_bus/src/replica/listener.rs
@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 8, 2026
@hubcio

hubcio commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ready

@github-actions github-actions Bot added S-waiting-on-review PR is waiting on a reviewer and removed S-waiting-on-author PR is waiting on author response labels Jun 10, 2026
@numinnex

numinnex commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Ok, the idea is sound and looks good, there is one piece that I think is worth mentioning, replica_id generation.

Currently we are supplying the replica_id through an CLI argument, it's fine for an "seed" cluster values (e.g perfect information first bootstrap), but falls flat in cases where we would like to have dynamic cluster counts.

One way to address this isssue would be to keep the initial seed, but on the first bootstrap (when the metadata log is empty, so the cluster is fresh), write that seed configuration into the metadata log with op=0. Make the replica_id be generated at runtime, rather than supplied (the one comming from CLI arg would be used as an index into the replicas set, so the initial seed configuration has granular control over all of the nodes required for determinism).
After the initial configuration is written to the log and resolved, we can go through the challenge protocol with those nodes). I omitted one important detail which is ReplicaJoin command, as this is the command that will assign the replica_id (it has to be written to the log and replicated).

@hubcio hubcio merged commit 7bf1a24 into master Jun 10, 2026
160 of 162 checks passed
@hubcio hubcio deleted the auth-in-cluster branch June 10, 2026 14:20
@github-actions github-actions Bot removed the S-waiting-on-review PR is waiting on a reviewer label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants